Content Chunkers

The DRUID KB Engine supports two types of content chunking (article extraction) methods: Basic and LLM. These options help define how content is divided into chunks for processing, allowing you to customize the level of detail extracted from documents.

Basic Content Chunking

This method extracts articles in chunks of 512 tokens (approximately 2,000 characters).

Use Case: Use this option for simple content extraction when you do not require deep semantic understanding. It’s ideal for smaller, less complex documents or when speed is a priority over nuanced content interpretation.

LLM Content Chunking

LLM chunking uses generative AI technology to perform smart content extraction, focusing on extracting more meaningful and contextually relevant chunks. This approach is more sophisticated, as it leverages language models to identify and extract key pieces of content, improving the relevance of each chunk.

Hint: Starting with DRUID 8.19, LLM-based chunking is enabled by default to extract contextually relevant content chunks. If the default LLM resource (such as Becus) is unavailable, DRUID automatically falls back to an alternative model like Azure OpenAI GPT-4o-mini, if it's configured. If no generative resource is activated on your tenant, DRUID defaults to the basic (non-AI) chunking method. Resetting advanced settings restores LLM chunking using the available resource.

Adding a new LLM chunker

To add a new LLM chunker:

Click Add new.
Select LLM as the chunker type.
Select the generative endpoint you want to use.
Configure the chunker parameters (see details below).
Click Save.

Adding Content Chunkers in DRUID Versions Prior to 8.4

In versions prior to DRUID 8.4, content chunkers can be added manually through the Advanced Settings JSON field. To do this:

Navigate to the Advanced Settings JSON section.
In the contentChunkers section, copy the structure of an existing chunker and paste it.
Set the caption parameter to define the name of your new chunker.
Save the changes, and your new chunker will appear in the Content Chunkers section of the UI.
Select LLM from the Type field.
Select the generative endpoint you want to use.
Configure the chunker parameters (see details below).
Click Save.

Configuring LLM Chunker Parameters

Caption

Customize the name of the content chunker that will be displayed in the user interface. This helps differentiate between different chunkers if you have multiple configurations.

Max Tokens and Max Lines

These parameters are automatically populated when you select the generative endpoint. They are specific to the generative model being used. For optimal performance, you should adjust these values based on the capabilities of the selected generative endpoint.

Both Max Tokens and Max Lines are limits that can be applied in text processing, but they serve different purposes.

Max Tokens limits the number of tokens (words or characters) the model can extract from each chunk. It controls the length of a response in AI models to prevent excessively long outputs.

Note: Max Tokens cannot exceed the remaining space in the Model Context Window Size after accounting for the input tokens. If your input takes up most of the context window, the maximum number of tokens for the output will be reduced. The Max Tokens parameter is not available for LLM chunkers.

Max Lines limits the number of lines to be extracted from each chunk regardless of token count. It structures the response into a fixed number of lines.

Hint: If you set both limits, the AI will generate up to Max Tokens but must also stop at Max Lines, whichever comes first.

Example: If you use a model with a token limit of 4,096 tokens, you might want to set Max Tokens to 2,000 and Max Lines to 30 to ensure each chunk is appropriately sized without overwhelming the model.

Prompt

This parameter is a part of the system's code and cannot be edited. It defines how the model interprets and processes the chunking task.

Keep Table Structure

This option, allows you to exclude tables from LLM chunking and process them with DRUID’s proprietary chunking technology. Enabling this option is useful for maintaining the structure and readability of tables in the extracted content, ensuring tables are handled more precisely than with LLM-based chunking.

Note: This option is available in DRUID 8.4 and higher.

Hint: If your documents include structured tables or data that need to be accurately retained, consider enabling this option for better results.

Customizing Content Chunker LLM Instructions

By default, DRUID provides generic instructions for LLM content chunkers. You can customize these instructions directly in the UI.

To modify the content chunker LLM instructions:

In the KB Advanced Settings, click on the Content chunker LLM instructions section.
Make your changes to the instructions as needed.

Click the Save & Close button at the bottom of the page.

In DRUID versions prior to 8.17, the Content chunker LLM instructions section is hidden by default.

To enable it:

Go to the Advanced Settings JSON section.
Under the featureFlags object, locate the ContentChunkerLlmInstructions property.
Set the caption value to a string, such as "null".

Click the Save & Close button at the bottom of the page.